Bayesian inference and multiple comparisons

From idealization of methods to conscious practice

Margherita Calderan

2025-12-02

About me 👋

  • Postdoctoral Researcher in Cognitive Psychology
  • Research interests:
    • Computational modeling of cognition and learning
    • Bayesian hypothesis testing (recently)
    • Meta-science


Have I ever considered multiplicity? No 😔

Hypothesis testing and errors

When testing a null hypothesis H_0 against an alternative H_1, we can make two types of errors.


Reality Reject H_0 Fail to reject H_0
H_0 true Type I Error (False Positive) Correct decision
H_1 true Correct decision Type II Error (False Negative)


We (usually) control the Type I error rate at a pre-specified level \alpha (typically 0.05).

Frequentist hypothesis testing

You compare the scores of the control and experimental groups and obtain a p-value of 0.05.


If, hypothetically, the experiment were repeated a great number of times when the null hypothesis is true, we would falsely reject the null hypothesis about 5% of the time.

Bayesian hypothesis testing

After comparing the scores of the control and experimental groups, you find that the posterior probability that the groups differ is 95%.


Based on the data and your prior, there is a 95% probability that groups differ. If you act as though the groups truly differ, there is a 5% chance that this decision is wrong.

Bayesian methods in psychology

(Heck et al., 2023; Jevremov & Pajić, 2024; Van De Schoot et al., 2017)

Multiplicity

Multiplicity (or multiple testing problem) emerges when we test more than one hypothesis.


  • I test the effect of the manipulation on both working memory and inhibitory control.

  • I test the effect of the manipulation and its interaction with age on working memory.

  • I run post hoc tests to examine differences across four time-points.

Why multiplicity creates problems

When testing m hypotheses independently, each at Type I error rate \alpha, the overall probability of at least one false positive (family wise error rate) grows rapidly with m.

Multiplicity in Bayes

Bayes Factor

  1. It measures the probability that the null hypothesis is true given the data.
  1. It measures the probability that the alternative hypothesis is true given the data.
  1. It is the ratio of how likely the observed data are under one hypothesis compared to another.

Bayes Factor

From Keysers et al. (2020)

Bayes Factor and multiple testing

\text{Pr(at least one } BF \geq 3)

N.B. Results change depending on priors and threshold.

Posterior odds and multiple testing

Researchers often interpret BF as posterior odds (Hoijtink et al., 2016; Tendeiro et al., 2024; Tendeiro & Kiers, 2019).


\text{Posterior odds, } \frac{Pr(\mathcal{H_1} | data)}{Pr(\mathcal{H_0} | data)} = \frac{Pr(\mathcal{H_1})}{Pr(\mathcal{H_0})} \times \frac{p(data | \mathcal{H_1})}{p(data | \mathcal{H_0})}

Multiplicity adjustment can be incorporated by modifying the prior odds \frac{Pr(\mathcal{H}_1)}{Pr(\mathcal{H}_0)} based on the number of hypotheses tested, which produces posterior odds that control for multiple testing.

Bayesian estimation

The prior (grey) represents our initial guess about plausible values; after observing data, it updates to the posterior (blue), which narrows around the most likely value.

Bayesian estimation concrete example

Suppose we want to estimate the Stroop effect (0.060–0.120s). We can adopt priors centered at zero with varying standard deviations (τ = 0.1, 0.25, 0.5) to reflect increasing skepticism.


Probability of Direction

p_d = \max(Pr({\hat{\beta}} < 0), Pr({\hat{\beta}} > 0))


  • Measures the certainty of an effect’s direction (positive or negative).
  • Typically ranges from 0.5 to 1.
  • \text{p-value} \approx 2 \times (1 - pd)

Probability of Direction and Multiplicity

fwer = \text{Pr(at least one } P_d \geq 0.975)

The joint posterior approach

You want to test whether an intervention improves verbal and/or visuospatial abilities, identify which domains improve. Improvement in at least one domain is more probable than improvement in both.


It is important, of course, not to make the elementary mistake of supposing that joint it probability statements can generally be made using the marginal distributions (Box & Tiao, 1992, p. 122).

From @box1992

Example of hypothesis testing


\theta_1 = verbal abilities
\theta_2 = visuospatial abilities


vertical line: \theta_1 = 0
horizontal line: \theta_2 = 0
diagonal: \theta_1 = \theta_2
intersection: \theta_1 = \theta_2 = 0

Questions for You


Who here has used Bayesian methods? What was your experience?


What are the main barriers to correct statistical understanding and practice in your field? What would help us improve?


We often test multiple hypotheses in a single study. What drives this practice in your research? Do the benefits outweigh the costs?

H_0 = 0, H_1 = Cauchy(scale = \sqrt{2}/2)

Between-models priors (Jeffreys, 1938)


  • m alternative hypotheses.

  • Let \mathcal{H}_A be the event that at least one alternative hypothesis is true.

  • We assign equal prior probability to the event \mathcal{H}_A and to its complement \mathcal{H}_0 (the global null, all false):

Pr(\mathcal{H}_0) = Pr(\mathcal{H}_A) = \frac{1}{2}


The probability that all alternative are false


Since this equals the global null probability


The prior of each alternative hypothesis


The prior for each null hypothesis


(1 - Pr(h_{1_i}))^m


(1 - Pr(h_{1_i}))^m = 0.5


Pr(h_{1_i}) = 1 - 0.5^{1/m}


Pr(h_{0_i}) = 0.5^{1/m}


This ensures that the probability of all nulls being true simultaneously and the probability of at least one alternative being true both equal 0.5.

Westfall’s approach for dependent tests

For multiple comparisons among m means, Westfall et al. (1997) developed a solution for the \binom{m}{2} possible pairwise comparisons involving \mu_1, \mu_2, \ldots, \mu_m, where \theta_{ij} represents the difference \mu_i - \mu_j. Their hierarchical prior structure specifies:

\mu_i \begin{cases} \equiv \mu & \text{with probability } \lambda, \\ \sim G & \text{with probability } 1-\lambda, \end{cases}

with G(\cdot) being a continuous distribution.

Under this specification, the probability that any two specific means equal \mu is \lambda^2, and \text{Pr}(\text{all }\theta_{ij} = 0) = \lambda^m. This dependency structure yields a modified adjustment k = 1-0.5^{2/m}, replacing the independence-based correction (resembles previously described mixture models).

Note: Implemented in JASP for post-hoc testing (for details, see de Jong, 2019).

Bayesian estimation prior precision

References

Box, G. E. P., & Tiao, G. C. (1992). Bayesian inference in statistical analysis (1st ed.). Wiley. https://doi.org/10.1002/9781118033197
Heck, D. W., Boehm, U., Böing-Messing, F., Bürkner, P.-C., Derks, K., Dienes, Z., Fu, Q., Gu, X., Karimova, D., Kiers, H. A., et al. (2023). A review of applications of the bayes factor in psychological research. Psychological Methods, 28(3), 558.
Hoijtink, H., Kooten, P. van, & Hulsker, K. (2016). Why bayesian psychologists should change the way they use the bayes factor. Multivariate Behavioral Research, 51(1), 2–10.
Jeffreys, H. (1938). Significance tests when several degrees of freedom arise simultaneously. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 165(921), 161–198. https://doi.org/10.1098/rspa.1938.0052
Jevremov, T., & Pajić, D. (2024). Bayesian method in psychology: A bibliometric analysis. Current Psychology, 43(10), 8644–8654.
Keysers, C., Gazzola, V., & Wagenmakers, E.-J. (2020). Using bayes factor hypothesis testing in neuroscience to establish evidence of absence. Nature Neuroscience, 23(7), 788–799.
Tendeiro, J. N., & Kiers, H. A. (2019). A review of issues about null hypothesis bayesian testing. Psychological Methods, 24(6), 774.
Tendeiro, J. N., Kiers, H. A., Hoekstra, R., Wong, T. K., & Morey, R. D. (2024). Diagnosing the misuse of the bayes factor in applied research. Advances in Methods and Practices in Psychological Science, 7(1), 25152459231213371.
Van De Schoot, R., Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017). A systematic review of bayesian articles in psychology: The last 25 years. Psychological Methods, 22(2), 217.
Westfall, P. H., Johnson, W. O., & Utts, J. M. (1997). A bayesian perspective on the bonferroni adjustment. Biometrika, 84(2), 419–427. http://www.jstor.org/stable/2337467